Guided Project: Designing and Creating a Database
Posted on Wed 08 July 2015 in Projects
import sqlite3
import pandas as pd
import csv
pd.set_option('max_columns', 180)
pd.set_option('max_rows', 200000)
pd.set_option('max_colwidth', 5000)
Getting to Know the Data¶
log = pd.read_csv("game_log.csv",low_memory=False)
print(log.shape)
log.head()
log.tail()
It looks like the game log has a record of over 170,000 games. It looks like these games are chronologically ordered and occur between 1871 and 2016.
For each game we have:
- general information on the game
- team level stats for each team
- a list of players from each team, numbered, with their defensive positions
- the umpires that officiated the game
- some 'awards', like winning and losing pitcher
We have a game_log_fields.txt
file that tell us that the player number corresponds with the order in which they batted.
It's worth noting that there is no natural primary key column for this table.
person = pd.read_csv('person_codes.csv')
print(person.shape)
person.head()
This seems to be a list of people with IDs. The IDs look like they match up with those used in the game log. There are debut dates, for players, managers, coaches and umpires. We can see that some people might have been one or more of these roles.
It also looks like coaches and managers are two different things in baseball. After some research, managers are what would be called a 'coach' or 'head coach' in other sports, and coaches are more specialized, like base coaches. It also seems like coaches aren't recorded in the game log.
park = pd.read_csv('park_codes.csv')
print(park.shape)
park.head()
This seems to be a list of all baseball parks. There are IDs which seem to match with the game log, as well as names, nicknames, city and league.
team = pd.read_csv('team_codes.csv')
print(team.shape)
team.head()
This seems to be a list of all teams, with team_ids which seem to match the game log. Interestingly, there is a franch_id
, let's take a look at this:
team["franch_id"].value_counts().head()
We might have franch_id
occurring a few times for some teams, let's look at the first one in more detail.
team[team["franch_id"] == 'BS1']
It appears that teams move between leagues and cities. The team_id changes when this happens, franch_id
(which is probably 'Franchise') helps us tie all of this together.
Defensive Positions
In the game log, each player has a defensive position listed, which seems to be a number between 1-10. Doing some research around this, I found this article which gives us a list of names for each numbered position:
- Pitcher
- Catcher
- 1st Base
- 2nd Base
- 3rd Base
- Shortstop
- Left Field
- Center Field
- Right Field
The 10th position isn't included, it may be a way of describing a designated hitter that does not field. I can find a retrosheet page that indicates that position 0
is used for this, but we don't have any position 0 in our data. I have chosen to make this an 'Unknown Position' so I'm not including data based on a hunch.
Leagues
Wikipedia tells us there are currently two leagues - the American (AL) and National (NL). Let's start by finding out what leagues are listed in the main game log:
log["h_league"].value_counts()
It looks like most of our games fall into the two current leagues, but that there are four other leagues. Let's write a quick function to get some info on the years of these leagues:
def league_info(league):
league_games = log[log["h_league"] == league]
earliest = league_games["date"].min()
latest = league_games["date"].max()
print("{} went from {} to {}".format(league,earliest,latest))
for league in log["h_league"].unique():
league_info(league)
Now we have some years which will help us do some research. After some googling we come up with:
NL
: National LeagueAL
: American LeagueAA
: American AssociationFL
: Federal LeaguePL
: Players LeagueUA
: Union Association
It also looks like we have about 1000 games where the home team doesn't have a value for league.
Importing Data into SQLite¶
# These helper functions will be useful as we work
# with the SQLite database from python
DB = "mlb.db"
def run_query(q):
with sqlite3.connect(DB) as conn:
return pd.read_sql(q,conn)
def run_command(c):
with sqlite3.connect(DB) as conn:
conn.execute('PRAGMA foreign_keys = ON;')
conn.isolation_level = None
conn.execute(c)
def show_tables():
q = '''
SELECT
name,
type
FROM sqlite_master
WHERE type IN ("table","view");
'''
return run_query(q)
tables = {
"game_log": log,
"person_codes": person,
"team_codes": team,
"park_codes": park
}
with sqlite3.connect(DB) as conn:
for name, data in tables.items():
conn.execute("DROP TABLE IF EXISTS {};".format(name))
data.to_sql(name,conn,index=False)
show_tables()
c1 = """
ALTER TABLE game_log
ADD COLUMN game_id TEXT;
"""
# try/except loop since ALTER TABLE
# doesn't support IF NOT EXISTS
try:
run_command(c1)
except:
pass
c2 = """
UPDATE game_log
SET game_id = date || h_name || number_of_game
/* WHERE prevents this if it has already been done */
WHERE game_id IS NULL;
"""
run_command(c2)
q = """
SELECT
game_id,
date,
h_name,
number_of_game
FROM game_log
LIMIT 5;
"""
run_query(q)
Looking for Normalization Opportunities¶
The following are opportunities for normalization of our data:
- In
person_codes
, all the debut dates will be able to be reproduced using game log data. - In
team_codes
, the start, end and sequence columns will be able to be reproduced using game log data. - In
park_codes
, the start and end years will be able to be reproduced using game log data. While technically the state is an attribute of the city, we might not want to have a an incomplete city/state table so we will leave this in. - There are lots of places in
game
log where we have a player ID followed by the players name. We will be able to remove this and use the name data inperson_codes
- In
game_log
, all offensive and defensive stats are repeated for the home team and the visiting team. We could break these out and have a table that lists each game twice, one for each team, and cut out this column repetition. - Similarly, in
game_log
, we have a listing for 9 players on each team with their positions - we can remove these and have one table that tracks player appearances and their positions. - We can do a similar thing with the umpires from
game_log
, instead of listing all four positions as columns, we can put the umpires either in their own table or make one table for players, umpires and managers. - We have several awards in
game_log
like winning pitcher and losing pitcher. We can either break these out into their own table, have a table for awards, or combine the awards in with general appearances like the players and umpires.
Planning a Normalized Schema¶
The following schema was planned using DbDesigner.net:
Creating Tables Without Foreign Keys¶
c1 = """
CREATE TABLE IF NOT EXISTS person (
person_id TEXT PRIMARY KEY,
first_name TEXT,
last_name TEXT
);
"""
c2 = """
INSERT OR IGNORE INTO person
SELECT
id,
first,
last
FROM person_codes;
"""
q = """
SELECT * FROM person
LIMIT 5;
"""
run_command(c1)
run_command(c2)
run_query(q)
c1 = """
CREATE TABLE IF NOT EXISTS park (
park_id TEXT PRIMARY KEY,
name TEXT,
nickname TEXT,
city TEXT,
state TEXT,
notes TEXT
);
"""
c2 = """
INSERT OR IGNORE INTO park
SELECT
park_id,
name,
aka,
city,
state,
notes
FROM park_codes;
"""
q = """
SELECT * FROM park
LIMIT 5;
"""
run_command(c1)
run_command(c2)
run_query(q)
c1 = """
CREATE TABLE IF NOT EXISTS league (
league_id TEXT PRIMARY KEY,
name TEXT
);
"""
c2 = """
INSERT OR IGNORE INTO league
VALUES
("NL", "National League"),
("AL", "American League"),
("AA", "American Association"),
("FL", "Federal League"),
("PL", "Players League"),
("UA", "Union Association")
;
"""
q = """
SELECT * FROM league
"""
run_command(c1)
run_command(c2)
run_query(q)
c1 = "DROP TABLE IF EXISTS appearance_type;"
run_command(c1)
c2 = """
CREATE TABLE appearance_type (
appearance_type_id TEXT PRIMARY KEY,
name TEXT,
category TEXT
);
"""
run_command(c2)
appearance_type = pd.read_csv('appearance_type.csv')
with sqlite3.connect('mlb.db') as conn:
appearance_type.to_sql('appearance_type',
conn,
index=False,
if_exists='append')
q = """
SELECT * FROM appearance_type;
"""
run_query(q)
Adding The Team and Game Tables¶
c1 = """
CREATE TABLE IF NOT EXISTS team (
team_id TEXT PRIMARY KEY,
league_id TEXT,
city TEXT,
nickname TEXT,
franch_id TEXT,
FOREIGN KEY (league_id) REFERENCES league(league_id)
);
"""
c2 = """
INSERT OR IGNORE INTO team
SELECT
team_id,
league,
city,
nickname,
franch_id
FROM team_codes;
"""
q = """
SELECT * FROM team
LIMIT 5;
"""
run_command(c1)
run_command(c2)
run_query(q)
c1 = """
CREATE TABLE IF NOT EXISTS game (
game_id TEXT PRIMARY KEY,
date TEXT,
number_of_game INTEGER,
park_id TEXT,
length_outs INTEGER,
day BOOLEAN,
completion TEXT,
forefeit TEXT,
protest TEXT,
attendance INTEGER,
legnth_minutes INTEGER,
additional_info TEXT,
acquisition_info TEXT,
FOREIGN KEY (park_id) REFERENCES park(park_id)
);
"""
c2 = """
INSERT OR IGNORE INTO game
SELECT
game_id,
date,
number_of_game,
park_id,
length_outs,
CASE
WHEN day_night = "D" THEN 1
WHEN day_night = "N" THEN 0
ELSE NULL
END
AS day,
completion,
forefeit,
protest,
attendance,
length_minutes,
additional_info,
acquisition_info
FROM game_log;
"""
q = """
SELECT * FROM game
LIMIT 5;
"""
run_command(c1)
run_command(c2)
run_query(q)
Adding the Team Appearance Table¶
c1 = """
CREATE TABLE IF NOT EXISTS team_appearance (
team_id TEXT,
game_id TEXT,
home BOOLEAN,
league_id TEXT,
score INTEGER,
line_score TEXT,
at_bats INTEGER,
hits INTEGER,
doubles INTEGER,
triples INTEGER,
homeruns INTEGER,
rbi INTEGER,
sacrifice_hits INTEGER,
sacrifice_flies INTEGER,
hit_by_pitch INTEGER,
walks INTEGER,
intentional_walks INTEGER,
strikeouts INTEGER,
stolen_bases INTEGER,
caught_stealing INTEGER,
grounded_into_double INTEGER,
first_catcher_interference INTEGER,
left_on_base INTEGER,
pitchers_used INTEGER,
individual_earned_runs INTEGER,
team_earned_runs INTEGER,
wild_pitches INTEGER,
balks INTEGER,
putouts INTEGER,
assists INTEGER,
errors INTEGER,
passed_balls INTEGER,
double_plays INTEGER,
triple_plays INTEGER,
PRIMARY KEY (team_id, game_id),
FOREIGN KEY (team_id) REFERENCES team(team_id),
FOREIGN KEY (game_id) REFERENCES game(game_id),
FOREIGN KEY (team_id) REFERENCES team(team_id)
);
"""
run_command(c1)
c2 = """
INSERT OR IGNORE INTO team_appearance
SELECT
h_name,
game_id,
1 AS home,
h_league,
h_score,
h_line_score,
h_at_bats,
h_hits,
h_doubles,
h_triples,
h_homeruns,
h_rbi,
h_sacrifice_hits,
h_sacrifice_flies,
h_hit_by_pitch,
h_walks,
h_intentional_walks,
h_strikeouts,
h_stolen_bases,
h_caught_stealing,
h_grounded_into_double,
h_first_catcher_interference,
h_left_on_base,
h_pitchers_used,
h_individual_earned_runs,
h_team_earned_runs,
h_wild_pitches,
h_balks,
h_putouts,
h_assists,
h_errors,
h_passed_balls,
h_double_plays,
h_triple_plays
FROM game_log
UNION
SELECT
v_name,
game_id,
0 AS home,
v_league,
v_score,
v_line_score,
v_at_bats,
v_hits,
v_doubles,
v_triples,
v_homeruns,
v_rbi,
v_sacrifice_hits,
v_sacrifice_flies,
v_hit_by_pitch,
v_walks,
v_intentional_walks,
v_strikeouts,
v_stolen_bases,
v_caught_stealing,
v_grounded_into_double,
v_first_catcher_interference,
v_left_on_base,
v_pitchers_used,
v_individual_earned_runs,
v_team_earned_runs,
v_wild_pitches,
v_balks,
v_putouts,
v_assists,
v_errors,
v_passed_balls,
v_double_plays,
v_triple_plays
from game_log;
"""
run_command(c2)
q = """
SELECT * FROM team_appearance
WHERE game_id = (
SELECT MIN(game_id) from game
)
OR game_id = (
SELECT MAX(game_id) from game
)
ORDER By game_id, home;
"""
run_query(q)
Adding the Person Appearance Table¶
c0 = "DROP TABLE IF EXISTS person_appearance"
run_command(c0)
c1 = """
CREATE TABLE person_appearance (
appearance_id INTEGER PRIMARY KEY,
person_id TEXT,
team_id TEXT,
game_id TEXT,
appearance_type_id,
FOREIGN KEY (person_id) REFERENCES person(person_id),
FOREIGN KEY (team_id) REFERENCES team(team_id),
FOREIGN KEY (game_id) REFERENCES game(game_id),
FOREIGN KEY (appearance_type_id) REFERENCES appearance_type(appearance_type_id)
);
"""
c2 = """
INSERT OR IGNORE INTO person_appearance (
game_id,
team_id,
person_id,
appearance_type_id
)
SELECT
game_id,
NULL,
hp_umpire_id,
"UHP"
FROM game_log
WHERE hp_umpire_id IS NOT NULL
UNION
SELECT
game_id,
NULL,
[1b_umpire_id],
"U1B"
FROM game_log
WHERE "1b_umpire_id" IS NOT NULL
UNION
SELECT
game_id,
NULL,
[2b_umpire_id],
"U2B"
FROM game_log
WHERE [2b_umpire_id] IS NOT NULL
UNION
SELECT
game_id,
NULL,
[3b_umpire_id],
"U3B"
FROM game_log
WHERE [3b_umpire_id] IS NOT NULL
UNION
SELECT
game_id,
NULL,
lf_umpire_id,
"ULF"
FROM game_log
WHERE lf_umpire_id IS NOT NULL
UNION
SELECT
game_id,
NULL,
rf_umpire_id,
"URF"
FROM game_log
WHERE rf_umpire_id IS NOT NULL
UNION
SELECT
game_id,
v_name,
v_manager_id,
"MM"
FROM game_log
WHERE v_manager_id IS NOT NULL
UNION
SELECT
game_id,
h_name,
h_manager_id,
"MM"
FROM game_log
WHERE h_manager_id IS NOT NULL
UNION
SELECT
game_id,
CASE
WHEN h_score > v_score THEN h_name
ELSE v_name
END,
winning_pitcher_id,
"AWP"
FROM game_log
WHERE winning_pitcher_id IS NOT NULL
UNION
SELECT
game_id,
CASE
WHEN h_score < v_score THEN h_name
ELSE v_name
END,
losing_pitcher_id,
"ALP"
FROM game_log
WHERE losing_pitcher_id IS NOT NULL
UNION
SELECT
game_id,
CASE
WHEN h_score > v_score THEN h_name
ELSE v_name
END,
saving_pitcher_id,
"ASP"
FROM game_log
WHERE saving_pitcher_id IS NOT NULL
UNION
SELECT
game_id,
CASE
WHEN h_score > v_score THEN h_name
ELSE v_name
END,
winning_rbi_batter_id,
"AWB"
FROM game_log
WHERE winning_rbi_batter_id IS NOT NULL
UNION
SELECT
game_id,
v_name,
v_starting_pitcher_id,
"PSP"
FROM game_log
WHERE v_starting_pitcher_id IS NOT NULL
UNION
SELECT
game_id,
h_name,
h_starting_pitcher_id,
"PSP"
FROM game_log
WHERE h_starting_pitcher_id IS NOT NULL;
"""
template = """
INSERT INTO person_appearance (
game_id,
team_id,
person_id,
appearance_type_id
)
SELECT
game_id,
{hv}_name,
{hv}_player_{num}_id,
"O{num}"
FROM game_log
WHERE {hv}_player_{num}_id IS NOT NULL
UNION
SELECT
game_id,
{hv}_name,
{hv}_player_{num}_id,
"D" || CAST({hv}_player_{num}_def_pos AS INT)
FROM game_log
WHERE {hv}_player_{num}_id IS NOT NULL;
"""
run_command(c1)
run_command(c2)
for hv in ["h","v"]:
for num in range(1,10):
query_vars = {
"hv": hv,
"num": num
}
run_command(template.format(**query_vars))
print(run_query("SELECT COUNT(DISTINCT game_id) games_game FROM game"))
print(run_query("SELECT COUNT(DISTINCT game_id) games_person_appearance FROM person_appearance"))
q = """
SELECT
pa.*,
at.name,
at.category
FROM person_appearance pa
INNER JOIN appearance_type at on at.appearance_type_id = pa.appearance_type_id
WHERE PA.game_id = (
SELECT max(game_id)
FROM person_appearance
)
ORDER BY team_id, appearance_type_id
"""
run_query(q)
Removing the Original Tables¶
show_tables()
tables = [
"game_log",
"park_codes",
"team_codes",
"person_codes"
]
for t in tables:
c = '''
DROP TABLE {}
'''.format(t)
run_command(c)
show_tables()